Skip to content

Derive other indexes directly for binary fuse#54

Merged
lemire merged 1 commit intoFastFilter:masterfrom
RaduBerinde:other-index-2.5
Jan 26, 2026
Merged

Derive other indexes directly for binary fuse#54
lemire merged 1 commit intoFastFilter:masterfrom
RaduBerinde:other-index-2.5

Conversation

@RaduBerinde
Copy link
Contributor

We manipulate the math and use bit tricks to derive the other two
indexes more efficiently during peeling.

Apple M1:

name                                old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-10         43.8 ± 2%      50.3 ± 3%  +14.88%  (p=0.000 n=8+9)
BinaryFusePopulate/8/n=100000-10        38.6 ± 3%      41.3 ± 1%   +7.09%  (p=0.000 n=9+8)
BinaryFusePopulate/8/n=1000000-10       35.0 ± 4%      36.5 ± 7%   +4.12%  (p=0.013 n=9+10)
BinaryFusePopulate/16/n=10000-10        48.6 ± 4%      48.5 ± 6%     ~     (p=1.000 n=10+10)
BinaryFusePopulate/16/n=100000-10       38.0 ± 3%      41.1 ± 1%   +8.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-10      33.8 ± 5%      36.6 ± 2%   +8.14%  (p=0.000 n=10+10)

GCE N4D (AMD Turin):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         53.2 ± 3%      57.1 ± 1%   +7.46%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        33.0 ± 0%      37.5 ± 1%  +13.38%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       28.5 ± 2%      31.8 ± 2%  +11.59%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        53.1 ± 1%      56.2 ± 1%   +5.93%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       31.8 ± 1%      37.3 ± 1%  +17.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      27.5 ± 1%      30.9 ± 1%  +12.34%  (p=0.000 n=10+10)

GCE C4 (Intel Emerald Rapids, turbo boost capped at "all core" max):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         29.2 ± 1%      32.2 ± 1%  +10.00%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        27.0 ± 3%      29.8 ± 5%  +10.22%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       25.6 ± 3%      28.2 ± 5%  +10.27%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        28.9 ± 1%      32.0 ± 1%  +10.84%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       26.2 ± 1%      28.8 ± 3%  +10.05%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      24.8 ± 2%      26.9 ± 2%   +8.37%  (p=0.000 n=10+10)

GCE C4A (Google's Axion ARM64):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         45.1 ± 1%      45.1 ± 1%    ~     (p=0.511 n=9+10)
BinaryFusePopulate/8/n=100000-8        39.8 ± 1%      39.4 ± 1%  -0.79%  (p=0.018 n=9+10)
BinaryFusePopulate/8/n=1000000-8       33.9 ± 3%      34.2 ± 3%    ~     (p=0.363 n=10+10)
BinaryFusePopulate/16/n=10000-8        44.0 ± 1%      44.7 ± 1%  +1.54%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       37.4 ± 1%      38.4 ± 1%  +2.75%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      30.9 ± 5%      32.4 ± 1%  +4.84%  (p=0.000 n=10+10)

We manipulate the math and use bit tricks to derive the other two
indexes more efficiently during peeling.

Apple M1:
```
name                                old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-10         43.8 ± 2%      50.3 ± 3%  +14.88%  (p=0.000 n=8+9)
BinaryFusePopulate/8/n=100000-10        38.6 ± 3%      41.3 ± 1%   +7.09%  (p=0.000 n=9+8)
BinaryFusePopulate/8/n=1000000-10       35.0 ± 4%      36.5 ± 7%   +4.12%  (p=0.013 n=9+10)
BinaryFusePopulate/16/n=10000-10        48.6 ± 4%      48.5 ± 6%     ~     (p=1.000 n=10+10)
BinaryFusePopulate/16/n=100000-10       38.0 ± 3%      41.1 ± 1%   +8.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-10      33.8 ± 5%      36.6 ± 2%   +8.14%  (p=0.000 n=10+10)
```

GCE N4D (AMD Turin):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         53.2 ± 3%      57.1 ± 1%   +7.46%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        33.0 ± 0%      37.5 ± 1%  +13.38%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       28.5 ± 2%      31.8 ± 2%  +11.59%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        53.1 ± 1%      56.2 ± 1%   +5.93%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       31.8 ± 1%      37.3 ± 1%  +17.35%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      27.5 ± 1%      30.9 ± 1%  +12.34%  (p=0.000 n=10+10)
```

GCE C4 (Intel Emerald Rapids, turbo boost capped at "all core" max):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         29.2 ± 1%      32.2 ± 1%  +10.00%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=100000-8        27.0 ± 3%      29.8 ± 5%  +10.22%  (p=0.000 n=10+10)
BinaryFusePopulate/8/n=1000000-8       25.6 ± 3%      28.2 ± 5%  +10.27%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=10000-8        28.9 ± 1%      32.0 ± 1%  +10.84%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=100000-8       26.2 ± 1%      28.8 ± 3%  +10.05%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      24.8 ± 2%      26.9 ± 2%   +8.37%  (p=0.000 n=10+10)
```

GCE C4A (Google's Axion ARM64):
```
name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         45.1 ± 1%      45.1 ± 1%    ~     (p=0.511 n=9+10)
BinaryFusePopulate/8/n=100000-8        39.8 ± 1%      39.4 ± 1%  -0.79%  (p=0.018 n=9+10)
BinaryFusePopulate/8/n=1000000-8       33.9 ± 3%      34.2 ± 3%    ~     (p=0.363 n=10+10)
BinaryFusePopulate/16/n=10000-8        44.0 ± 1%      44.7 ± 1%  +1.54%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       37.4 ± 1%      38.4 ± 1%  +2.75%  (p=0.000 n=10+10)
BinaryFusePopulate/16/n=1000000-8      30.9 ± 5%      32.4 ± 1%  +4.84%  (p=0.000 n=10+10)
```
@lemire
Copy link
Member

lemire commented Jan 15, 2026

On my todo to review.

@lemire lemire merged commit d0b48ae into FastFilter:master Jan 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants